A significant number of hotel bookings are called-off due to cancellations or no-shows. The typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost which is beneficial to hotel guests but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with. Such losses are particularly high on last-minute cancellations.
The new technologies involving online booking channels have dramatically changed customers’ booking possibilities and behavior. This adds a further dimension to the challenge of how hotels handle cancellations, which are no longer limited to traditional booking and guest characteristics.
The cancellation of bookings impact a hotel on various fronts:
The increasing number of cancellations calls for a Machine Learning based solution that can help in predicting which booking is likely to be canceled. INN Hotels Group has a chain of hotels in Portugal, they are facing problems with the high number of booking cancellations and have reached out to your firm for data-driven solutions. You as a data scientist have to analyze the data provided to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.
The data contains the different attributes of customers' booking details. The detailed data dictionary is given below.
Data Dictionary
Leading Questions:
# Importing libraries for reading the data manipulation:
import numpy as np
import pandas as pd
# Importing libraries for data visualization:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
from statsmodels.tools.sm_exceptions import ConvergenceWarning
warnings.simplefilter("ignore", ConvergenceWarning)
# Removes the limit for the number of displayed columns:
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows:
pd.set_option("display.max_rows", 200)
# setting the precision of floating numbers to 5 decimal points:
pd.set_option("display.float_format", lambda x: "%.5f" % x)
# To build logistic regression model using statsmodels:
import statsmodels.api as sm
# Importing function train_test_split to split the data into train and test:
from sklearn.model_selection import train_test_split
# Importing function variance_inflation_factor to compute VIF:
from statsmodels.stats.outliers_influence import variance_inflation_factor
# Metric Scores : to check model performance
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
precision_recall_curve,
roc_curve,
make_scorer
)
# To build decision tree model:
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
# To tune different models
from sklearn.model_selection import GridSearchCV
# Function to create histogram and box plots:
def creating_hist_box(df, feature, kde= True, bins=None, figsize =(10, 4)):
f2, (ax_hist, ax_box) = plt.subplots(nrows=1, ncols=2, figsize=figsize)
f2.tight_layout(pad=5.0)
if bins:
sns.histplot(data=df, x=feature, kde=kde, ax=ax_hist, bins=bins)
ax_hist.set_title(f'Histogram with bins = {bins}')
else:
sns.histplot(data=df, x=feature, kde=kde, ax=ax_hist)
ax_hist.set_title(f'Histogram with default no of bins.')
sns.boxplot(data=df, x=feature, ax=ax_box, showmeans=True, color="violet")
ax_box.set_title('Boxplot')
# Function to create labeled barplots:
def labeled_barplot(data, feature, perc=False, n=None, rotatn=0):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
rotatn: how to display x labels (default is horizontally(0), vertically(90))
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 2, 6))
else:
plt.figure(figsize=(n + 2, 6))
plt.xticks(rotation=rotatn,fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n],
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
# Function to create stacked barplots:
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
# Function to plot distributions wrt target:
def distribution_plot_wrt_target(data, predictor, target):
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
target_uniq = data[target].unique()
axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
sns.histplot(
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
stat="density",
)
axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
stat="density",
)
axs[1, 0].set_title("Boxplot w.r.t target")
sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
palette="gist_rainbow",
)
plt.tight_layout()
plt.show()
# Function to compute different metrics to check the performance of a classification model built using statsmodels:
def model_performance_classification_statsmodels(
model, predictors, target, threshold=0.5
):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
# checking which probabilities are greater than threshold
pred_temp = model.predict(predictors) > threshold
# rounding off the above values to get classes
pred = np.round(pred_temp)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
# Function to plot the confusion_matrix of a classification model:
def confusion_matrix_statsmodels(model, predictors, target, threshold=0.5):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
y_pred = model.predict(predictors) > threshold
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
# Function to calculate the variance inflation factor:
def checking_vif(predictors):
vif = pd.DataFrame()
vif["feature"] = predictors.columns
# calculating VIF for each feature
vif["VIF"] = [
variance_inflation_factor(predictors.values, i)
for i in range(len(predictors.columns))
]
return vif
# Function to compute different metrics to check performance of a classification model built using sklearn:
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
# defining a function to plot the confusion_matrix of a dtree model:
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
from google.colab import drive
drive.mount('/content/drive')
# Loading the data:
data = pd.read_csv('/content/drive/MyDrive/Univ_Texas/Supervised_Learning_Classification/Project/INNHotelsGroup.csv')
# Displaying first 5 rows of the dataset:
data.head()
# Displaying the last 5 rows of the dataset:
data.tail()
Observation:
# Checking shape of the dataset:
data.shape
print(f'No. of rows in the dataset: {data.shape[0]}')
print(f'No. of columns in the dataset: {data.shape[1]}')
Observation:
# Info table: Checking the datatypes of the columns for the dataset:
data.info()
# List of columns with 'int' datatype:
data.select_dtypes(include='int').columns.to_list()
Observation:
# Checking percentage of missing values:
(data.isnull().sum()/data.shape[0])*100
# Checking for duplicate rows:
data.duplicated().sum()
Observations:
# Statistical summary of the dataset
data.describe().T
# Statistical summary of the dataset for 'object' columns:
data.describe(include='object')
Observations:
# Since 'Booking_ID' is a primary key column, so we drop the column for further analysis:
df = data.drop('Booking_ID', axis=1)
df.head()
# Since 'booking_status' is a flag column indicating whether the booking was canceled or not
# we change the 'Not_Canceled' values to 0 and 'Canceled' values to 1 for further analysis
booking_flag = {'Not_Canceled':0, 'Canceled':1}
df['booking_status'] = df['booking_status'].replace(booking_flag)
df.head()
# Histplot and Boxplot to show the distribution of data for the column 'lead_time':
creating_hist_box(df, 'lead_time', bins=40)
df['lead_time'].describe()
Observations:
# Histplot and Boxplot to show the distribution of data for the column 'avg_price_per_room':
creating_hist_box(df, 'avg_price_per_room', bins=40)
df['avg_price_per_room'].describe()
Observations:
df.loc[df['avg_price_per_room']==0].shape
df.loc[df["avg_price_per_room"] == 0, ["market_segment_type"]].value_counts()
Observations:
# Histplot and Boxplot to show the distribution of data for the column 'no_of_previous_cancellations':
creating_hist_box(df, 'no_of_previous_cancellations', bins=40)
df['no_of_previous_cancellations'].describe()
df.loc[df['no_of_previous_cancellations']==0].shape[0]/df.shape[0] * 100
df.loc[df['no_of_previous_cancellations']!=0].shape[0]/df.shape[0] * 100
Observations:
# Histplot and Boxplot to show the distribution of data for the column 'no_of_previous_bookings_not_cancelled':
creating_hist_box(df, 'no_of_previous_bookings_not_canceled', bins=40)
df['no_of_previous_bookings_not_canceled'].describe()
Obsetvations:
# Barplot for the cloumn 'no_of_adults' in the dataset:
labeled_barplot(df,'no_of_adults', perc=True)
df['no_of_adults'].value_counts()/df.shape[0] * 100
Observations:
# Barplot for the cloumn 'no_of_children' in the dataset:
labeled_barplot(df, 'no_of_children', perc=True)
df['no_of_children'].value_counts()/df.shape[0] * 100
Observations:
# Barplot for the cloumn 'no_of_week_nights' in the dataset:
labeled_barplot(df, 'no_of_week_nights', perc=True)
(df['no_of_week_nights'].value_counts()/df.shape[0] * 100).head(5)
(df['no_of_week_nights'].value_counts()/df.shape[0] * 100).tail()
Observations:
| No. of Week Nights | Percentage of reservation (%) |
|---|---|
| 2 | 31.55 |
| 1 | 26.16 |
| 3 | 21.61 |
| 4 | 8.24 |
# Barplot for the cloumn 'no_of_weekend_nights' in the dataset:
labeled_barplot(df, 'no_of_weekend_nights', perc=True)
(df['no_of_weekend_nights'].value_counts()/df.shape[0] * 100)
Observations:
| No. of Weekend Nights | Percentage of reservation (%) |
|---|---|
| 0 | 46.51 |
| 1 | 27.55 |
| 2 | 25.01 |
# Barplot for the cloumn 'required_car_parking_space' in the dataset:
labeled_barplot(df, 'required_car_parking_space', perc=True)
df['required_car_parking_space'].value_counts()
(df['required_car_parking_space'].value_counts()/df.shape[0] *100)
Observations:
# Barplot for the cloumn 'type_of_meal_plan' in the dataset::
labeled_barplot(df,'type_of_meal_plan', perc=True, rotatn=90)
round(df['type_of_meal_plan'].value_counts()/df.shape[0] *100, 2)
Observations:
| Meal Plan | Includes | Percentage of reservation (%) |
|---|---|---|
| Not Selected | N/A | 14.14 |
| Meal Plan 1 | Beakfast | 76.73 |
| Meal Plan 2 | Breakfast + Lunch | 9.11 |
| Meal Plan 3 | Breakfast + Lunch + Dinner | 0.01 |
# Barplot for the cloumn 'room_type_reserved' in the dataset:
labeled_barplot(df, 'room_type_reserved', perc=True, rotatn=90)
round(df['room_type_reserved'].value_counts()/df.shape[0] *100, 2)
Observations:
# Barplot for the cloumn 'arrival_month' in the dataset:
labeled_barplot(df, 'arrival_month', perc=True)
df['arrival_month'].value_counts()
(df['arrival_month'].value_counts()/df.shape[0])*100
Observations:
# Barplot for the cloumn 'arrival_year' in the dataset:
labeled_barplot(df, 'arrival_year', perc=True)
Observations:
# Barplot for the cloumn 'arrival_date' in the dataset:
labeled_barplot(df, 'arrival_date', perc=True)
Observations:
# Barplot for the cloumn 'market_segment_type' in the dataset:
labeled_barplot(df, 'market_segment_type', perc=True, rotatn=90)
(df['market_segment_type'].value_counts()/df.shape[0]) *100
Observations:
# Barplot for the cloumn 'no_of_special_requests' in the dataset:
labeled_barplot(df, 'no_of_special_requests', perc=True)
(df['no_of_special_requests'].value_counts()/df.shape[0])*100
Observations:
# Barplot for the cloumn 'booking_status' in the dataset:
labeled_barplot(df, 'booking_status', perc=True)
Observations:
cols_list = df.select_dtypes(include=np.number).columns.tolist()
df_corr = df[cols_list].corr()
df_corr
#Heatmap showing correlation values between different variables.
plt.figure(figsize=(12, 7))
sns.heatmap(df_corr, annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
Observations:
| Variable 1 | Variable 2 | Correlation Value |
|---|---|---|
| no_of_previous_cancellations | repeated_guest | 0.39 |
| no_of_previous_bookings_not_cancelled | repeated_guest | 0.54 |
| no_of_previous_bookings_not_cancelled | no_of_previous_cancellations | 0.47 |
| booking_status | lead_time | 0.44 |
| avg_price_per_room | no_of_children | 0.34 |
| avg_price_per_room | no_of_adults | 0.30 |
Average price per room is positively correlated with no. of children, no. of adults,no_of_week_nighs,required_car_parking_space, arrival_year, arrival_month, arrival_date, no. of special requests, booking status and negatively correlated with no_of_weekend_nights, repreated_guest, no_of_previous_cancellations and no_of_previous_bookings_not_cancelled.
The variable booking_status is positively correlated with no_of_adults, no_of_children, no_of_week_nights, no_of_weekend_nights, lead_time, arrival_date , avg_price_per_room and arrival_year and negatively correlated with arrival_month,required_car_parking_space, no_of_special_requests, repeated_guest, no_of_previous_bookings_not_cancelled, and no_of_previous_cancellations.
Arrival month and arrival year have the highest negative correlation (of value -0.34).
plt.figure(figsize=(10, 6))
sns.boxplot(data=df, x="market_segment_type", y="avg_price_per_room", palette="gist_rainbow")
plt.show()
plt.figure(figsize=(10, 6))
sns.lineplot(data=df, x="market_segment_type", y="avg_price_per_room")
plt.grid()
plt.show()
df.groupby(['market_segment_type'])['avg_price_per_room'].mean()
Observations:
stacked_barplot(df, "market_segment_type", "booking_status")
Observations:
plt.figure(figsize=(10, 5))
sns.boxplot(data = df, x='no_of_special_requests', y='avg_price_per_room')
plt.show()
plt.figure(figsize=(10, 6))
sns.lineplot(data=df, x="no_of_special_requests", y="avg_price_per_room", ci= False)
plt.grid()
plt.show()
df.groupby(['no_of_special_requests'])['avg_price_per_room'].mean()
Observations:
stacked_barplot(df, "no_of_special_requests", "booking_status")
Observations:
distribution_plot_wrt_target(df, "avg_price_per_room", "booking_status")
df.loc[df['booking_status'] ==0,'avg_price_per_room'].describe()
df.loc[df['booking_status'] ==1,'avg_price_per_room'].describe()
Observations:
distribution_plot_wrt_target(df, 'lead_time', 'booking_status')
df.loc[df['booking_status'] ==0,'lead_time'].describe()
df.loc[df['booking_status'] ==1,'lead_time'].describe()
Observations:
family_df = df[(df["no_of_children"] >= 0) & (df["no_of_adults"] > 1)]
family_df.shape
family_df["no_of_family_members"] = family_df["no_of_adults"] + family_df["no_of_children"]
family_df.head()
stacked_barplot(family_df, 'no_of_family_members', 'booking_status')
Observations:
stay_df = df[(df["no_of_week_nights"] > 0) & (df["no_of_weekend_nights"] > 0)]
stay_df.shape
stay_df["total_days"] = stay_df["no_of_week_nights"] + stay_df["no_of_weekend_nights"]
stay_df.head()
stacked_barplot(stay_df, 'total_days', 'booking_status')
Observations:
stacked_barplot(df, 'repeated_guest', 'booking_status')
Observations:
monthly_data = df.groupby(["arrival_month"])["booking_status"].count()
monthly_data
monthly_data = pd.DataFrame({"Month": list(monthly_data.index), "Guests": list(monthly_data.values)})
plt.figure(figsize=(10, 5))
sns.lineplot(data=monthly_data, x="Month", y="Guests")
plt.grid()
plt.show()
stacked_barplot (df, 'arrival_month', 'booking_status')
Observations:
# Box plot for showing distribution of room prices per month:
plt.figure(figsize = (8,6))
sns.boxplot(data= df, x='arrival_month', y='avg_price_per_room')
plt.show()
Observations:
stacked_barplot (df, 'room_type_reserved', 'repeated_guest')
Observations:
# Box plot for showing distribution of no_of_previous_bookings_not_canceled vs repeated_guests:
plt.figure(figsize = (8,6))
sns.boxplot(data= df, x='repeated_guest', y='no_of_previous_bookings_not_canceled')
plt.show()
Observations:
labeled_barplot(df, 'arrival_month', perc=True)
df['arrival_month'].value_counts()
Observations:
| Month | No. of reservations | Percentage of reservations (%) |
|---|---|---|
| October | 5317 | 14.7 |
| September | 4611 | 12.7 |
| August | 3813 | 10.5 |
| June | 3203 | 8.8 |
| December | 3021 | 8.3 |
| Month | No. of reservations | Percentage of reservations (%) |
|---|---|---|
| January | 1014 | 2.8 |
| February | 1704 | 4.7 |
| March | 2358 | 6.5 |
| May | 2598 | 7.2 |
| April | 2736 | 7.2 |
labeled_barplot(df, 'market_segment_type', perc=True, rotatn=90)
market_seg_df = df.groupby(['market_segment_type']).agg(no=('market_segment_type','count')).reset_index()
market_seg_df['percent'] = market_seg_df['no']/df.shape[0] *100
market_seg_df
Observations:
sns.lineplot(data=df, x='market_segment_type', y='avg_price_per_room')
plt.grid()
plt.show()
market_seg_price_df = df.groupby(['market_segment_type']).agg(average_price=('avg_price_per_room','mean')).reset_index()
market_seg_price_df.sort_values(by='average_price', ascending= False)
Observations:
| Month | Average Price per room per day(euros) |
|---|---|
| Online | 112.26 |
| Aviation | 100.70 |
| Offline | 91.63 |
| Corporate | 82.91 |
| Complementary | 3.14 |
labeled_barplot(df, 'booking_status', perc=True)
df.groupby(['booking_status'])['booking_status'].count()
df.groupby(['booking_status'])['booking_status'].count()/df.shape[0] *100
Observations:
plt.figure(figsize=(4,5))
sns.catplot(data=df, x='repeated_guest', hue='booking_status', kind='count')
plt.show()
guests_df = df.groupby(['repeated_guest','booking_status']).agg(number=('repeated_guest','count')).reset_index()
guests_df
repeated_guests_df = guests_df.query('repeated_guest==1')
repeated_guests_df['percent'] = repeated_guests_df['number']/repeated_guests_df['number'].sum() *100
repeated_guests_df
Observations:
plt.figure(figsize=(4,5))
sns.catplot(data=df, x='no_of_special_requests', hue='booking_status', kind='count')
plt.show()
stacked_barplot(df, "no_of_special_requests", "booking_status")
sp_request_df = df.groupby(['no_of_special_requests','booking_status']).agg(number=('no_of_special_requests','count')).reset_index()
sp_request_df
special_req_0 = sp_request_df.query('no_of_special_requests == 0')
special_req_0['percent'] = special_req_0['number']/special_req_0['number'].sum() *100
print(special_req_0)
percent_0_req = round(special_req_0.loc[1,'percent'],2)
special_req_1 = sp_request_df.query('no_of_special_requests == 1')
special_req_1['percent'] = special_req_1['number']/special_req_1['number'].sum() *100
print(special_req_1)
percent_1_req = round(special_req_1.loc[3,'percent'],2)
special_req_2 = sp_request_df.query('no_of_special_requests == 2')
special_req_2['percent'] = special_req_2['number']/special_req_2['number'].sum() *100
print(special_req_2)
percent_2_req = round(special_req_2.loc[5,'percent'],2)
special_req_3 = sp_request_df.query('no_of_special_requests == 3')
special_req_3['percent'] = special_req_3['number']/special_req_3['number'].sum() *100
print(special_req_3)
special_req_4 = sp_request_df.query('no_of_special_requests == 4')
special_req_4['percent'] = special_req_4['number']/special_req_4['number'].sum() *100
print(special_req_4)
special_req_5 = sp_request_df.query('no_of_special_requests == 5')
special_req_5['percent'] = special_req_5['number']/special_req_5['number'].sum() *100
print(special_req_5)
print(f'Out of the reservations made which had no special requests, {percent_0_req}% of the reservations were cancelled.')
print(f'Out of the reservations made which had 1 special request, {percent_1_req}% of the reservations were cancelled.')
print(f'Out of the reservations made which had 2 special requests, {percent_2_req}% of the reservations were cancelled.')
Observations:
| No.of special request | Percentage of canceled reservations (%) |
|---|---|
| 0 | 43.21 |
| 1 | 23.77 |
| 2 | 14.6 |
| 3 | 0 |
| 4 | 0 |
| 5 | 0 |
Since, there are no missing values, we do not have to carry out missing value treatment.
# outlier detection using boxplot
numeric_columns = df.select_dtypes(include=np.number).columns.tolist()
# dropping booking_status
numeric_columns.remove("booking_status")
plt.figure(figsize=(15, 12))
for i, variable in enumerate(numeric_columns):
plt.subplot(4, 4, i + 1)
plt.boxplot(data[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
Observations:
# Checking no of records that have avg_price_per_room > 500:
df.loc[df["avg_price_per_room"] >500].shape[0]
# Calculating upper whisker value for box plot for 'avg_price_per_room' column:
q1 = df["avg_price_per_room"].quantile(0.25)
q3 = df["avg_price_per_room"].quantile(0.75)
IQR = q3 - q1
upper_Whisker = q3 + 1.5 * IQR
upper_Whisker
# Replacing avg_price_per_room >500 with upper whisker value:
df.loc[df["avg_price_per_room"] >= 500, "avg_price_per_room"] = upper_Whisker
# Replacing records with no_of_children = 9 and 10 with 3:
df["no_of_children"] = df["no_of_children"].replace([9, 10], 3)
# Checking no of records that have 0 no_of_week_nights and 0 no_of_weekend_nights:
no_days_df = df.loc[(df['no_of_weekend_nights']==0) & (df['no_of_week_nights']==0)]
remove_index = no_days_df.index.to_list()
# Dropping records that have 0 no_of_week_nights and 0 no_of_weekend_nights:
df.drop(index=remove_index, inplace=True)
df.loc[(df['no_of_weekend_nights']==0) & (df['no_of_week_nights']==0)]
Since 'booking_status' is a flag column indicating whether the booking was canceled or not, we change the 'Not_Canceled' values to 0 and 'Canceled' values to 1 for further analysis .
This step has already been carried before the EDA section.
df_log_reg = df.copy()
df_dtree = df.copy()
To check if performing outlier treatment and dropping of rows has resulted in any major changes to the dataset.
df.describe().T
Observations:
| Variables | Value before imputation | Value after imputation |
|---|---|---|
| mean() : avg_price_per_room | 103.42354 | 103.63645 |
| std() : avg_price_per_room | 35.08942 | 34.72348 |
| 25%() : avg_price_per_room | 80.30000 | 80.75000 |
| 50%() : avg_price_per_room | 99.45000 | 99.60000 |
| max() : avg_price_per_room | 540.00 | 375.50 |
| mean() : no_of_children | 0.10528 | 0.39472 |
| std() : no_of_children | 0.40265 | 0.39472 |
| max() : no_of_children | 10 | 3 |
| mean() : no_of_weekend_nights | 0.81072 | 0.81247 |
| std() : no_of_weekend_nights | 0.87064 | 0.87077 |
| mean() : no_of_week_nights | 2.20430 | 2.20905 |
| std() : no_of_week_nights | 1.41090 | 1.40870 |
# Histplot and Boxplot to show the distribution of data for the column 'avg_price_per_room':
creating_hist_box(df, 'avg_price_per_room', bins=40)
df['avg_price_per_room'].describe()
Observations:
# Barplot for the cloumn 'no_of_children' in the dataset:
labeled_barplot(df, 'no_of_children', perc=True)
df['no_of_children'].value_counts()/df.shape[0] * 100
Observations:
# Barplot for the cloumn 'no_of_week_nights' in the dataset:
labeled_barplot(df, 'no_of_week_nights', perc=True)
(df['no_of_week_nights'].value_counts()/df.shape[0] * 100).head()
(df['no_of_week_nights'].value_counts()/df.shape[0] * 100).tail()
Observations:
# Barplot for the cloumn 'no_of_weekend_nights' in the dataset:
labeled_barplot(df, 'no_of_weekend_nights', perc=True)
(df['no_of_weekend_nights'].value_counts()/df.shape[0] * 100)
Observations:
# Barplot for the cloumn 'booking_status' in the dataset:
labeled_barplot(df, 'booking_status', perc=True)
Observations:
cols_list = df.select_dtypes(include=np.number).columns.tolist()
df_corr = df[cols_list].corr()
df_corr
#Heatmap showing correlation values between different variables.
plt.figure(figsize=(12, 7))
sns.heatmap(df_corr, annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
Observations:
| Variables | Corr. Value before imputation | Corr. Value after imputation |
|---|---|---|
| avg_price_per_room & repeated_guest | -0.17 | -0.18 |
| required_parking_space & no_of_children | 0.03 | 0.04 |
| arrival_year & no_of_weekend_nights | 0.06 | 0.05 |
| avg_price_per_room & no_of_children | 0.34 | 0.35 |
| avg_price_per_room & no_of_weekend_nights | -0.00 | -0.01 |
| avg_price_per_room & no_of_week_nights | 0.02 | 0.01 |
| avg_price_per_room & lead_time | -0.06 | -0.07 |
| avg_price_per_room & arrival_month | 0.05 | 0.06 |
| avg_price_per_room & no_of_previous_bookings_not_cancelled | -0.11 | -0.12 |
| no_of_special_requests & no_of_children | 0.12 | 0.13 |
| no_of_special_requests & avg_price_per_room | 0.18 | 0.19 |
plt.figure(figsize=(10, 6))
sns.boxplot(data=df, x="market_segment_type", y="avg_price_per_room", palette="gist_rainbow")
plt.show()
plt.figure(figsize=(10, 6))
sns.lineplot(data=df, x="market_segment_type", y="avg_price_per_room")
plt.grid()
plt.show()
df.groupby(['market_segment_type'])['avg_price_per_room'].mean()
Observations:
plt.figure(figsize=(10, 5))
sns.boxplot(data = df, x='no_of_special_requests', y='avg_price_per_room')
plt.show()
plt.figure(figsize=(10, 6))
sns.lineplot(data=df, x="no_of_special_requests", y="avg_price_per_room", ci= False)
plt.grid()
plt.show()
df.groupby(['no_of_special_requests'])['avg_price_per_room'].mean()
Observations:
distribution_plot_wrt_target(df, "avg_price_per_room", "booking_status")
df.loc[df['booking_status'] ==0,'avg_price_per_room'].describe()
df.loc[df['booking_status'] ==1,'avg_price_per_room'].describe()
Observations:
To check if performing outlier treatment and dropping of rows has resulted in any major changes.
labeled_barplot(df, 'arrival_month', perc=True)
df['arrival_month'].value_counts()
Observations:
| Arrival Month | Percentage of reservations before dropping records (%) | Percentage of reservations after dropping records (%) |
|---|---|---|
| October | 14.7 | 14.6 |
| July | 8.0 | 1.1 |
labeled_barplot(df, 'market_segment_type', perc=True, rotatn=90)
market_seg_df = df.groupby(['market_segment_type']).agg(no=('market_segment_type','count')).reset_index()
market_seg_df['percent'] = market_seg_df['no']/df.shape[0] *100
market_seg_df
Observations:
| Market Type | Percentage of reservations before dropping records (%) | Percentage of reservations after dropping records (%) |
|---|---|---|
| Offline | 29.0 | 29.1 |
| Complementary | 1.1 | 1.0 |
sns.lineplot(data=df, x='market_segment_type', y='avg_price_per_room')
plt.grid()
plt.show()
market_seg_price_df = df.groupby(['market_segment_type']).agg(average_price=('avg_price_per_room','mean')).reset_index()
market_seg_price_df.sort_values(by='average_price', ascending= False)
Observations:
| Market Type | Average Price per Room before dropping records (euros) | Average Price per Room after dropping records (euros) |
|---|---|---|
| Online | 112.26 | 112.57 |
| Complementary | 3.14 | 3.25 |
labeled_barplot(df, 'booking_status', perc=True)
df.groupby(['booking_status'])['booking_status'].count()
df.groupby(['booking_status'])['booking_status'].count()/df.shape[0] *100
Observations:
| Booking Status | Percentage of reservations before dropping records (%) | Percentage of reservations after dropping records (%) |
|---|---|---|
| Canceled | 67.24 | 67.17 |
| Not Canceled | 32.76 | 32.83 |
plt.figure(figsize=(4,5))
sns.catplot(data=df, x='repeated_guest', hue='booking_status', kind='count')
plt.show()
guests_df = df.groupby(['repeated_guest','booking_status']).agg(number=('repeated_guest','count')).reset_index()
guests_df
repeated_guests_df = guests_df.query('repeated_guest==1')
repeated_guests_df['percent'] = repeated_guests_df['number']/repeated_guests_df['number'].sum() *100
repeated_guests_df
Observations:
plt.figure(figsize=(4,5))
sns.catplot(data=df, x='no_of_special_requests', hue='booking_status', kind='count')
plt.show()
stacked_barplot(df, "no_of_special_requests", "booking_status")
sp_request_df = df.groupby(['no_of_special_requests','booking_status']).agg(number=('no_of_special_requests','count')).reset_index()
sp_request_df
special_req_0 = sp_request_df.query('no_of_special_requests == 0')
special_req_0['percent'] = special_req_0['number']/special_req_0['number'].sum() *100
print(special_req_0)
percent_0_req = round(special_req_0.loc[1,'percent'],2)
special_req_1 = sp_request_df.query('no_of_special_requests == 1')
special_req_1['percent'] = special_req_1['number']/special_req_1['number'].sum() *100
print(special_req_1)
percent_1_req = round(special_req_1.loc[3,'percent'],2)
special_req_2 = sp_request_df.query('no_of_special_requests == 2')
special_req_2['percent'] = special_req_2['number']/special_req_2['number'].sum() *100
print(special_req_2)
percent_2_req = round(special_req_2.loc[5,'percent'],2)
special_req_3 = sp_request_df.query('no_of_special_requests == 3')
special_req_3['percent'] = special_req_3['number']/special_req_3['number'].sum() *100
print(special_req_3)
special_req_4 = sp_request_df.query('no_of_special_requests == 4')
special_req_4['percent'] = special_req_4['number']/special_req_4['number'].sum() *100
print(special_req_4)
special_req_5 = sp_request_df.query('no_of_special_requests == 5')
special_req_5['percent'] = special_req_5['number']/special_req_5['number'].sum() *100
print(special_req_5)
print(f'Out of the reservations made which had no special requests, {percent_0_req}% of the reservations were cancelled.')
print(f'Out of the reservations made which had 1 special request, {percent_1_req}% of the reservations were cancelled.')
print(f'Out of the reservations made which had 2 special requests, {percent_2_req}% of the reservations were cancelled.')
Observations:
| No. of Special Requests | Percentage of reservations before dropping records (%) | Percentage of reservations after dropping records (%) |
|---|---|---|
| 0 | 43.21 | 43.27 |
| 1 | 23.77 | 23.84 |
| 2 | 14.60 | 14.63 |
Model can make correct predictions as:
Model can make wrong predictions as:
Which case is more important?
How to reduce the losses?
F1 Score to be maximized, the greater the F1 score higher the chances of minimizing False Negatives and False Positives.# Splitting data into independent(X) and dependent(y) variables:
X = df_log_reg.drop(["booking_status"], axis=1)
Y = df_log_reg["booking_status"]
# Adding constant to independent variable X:
X = sm.add_constant(X)
# Creating dummy columns:
X = pd.get_dummies(X,drop_first=True)
X.info()
Observations:
# Splitting in training and test set
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.30, random_state=1, stratify=Y)
# Train data shape:
print(f'Indep training df: {X_train.shape}')
print(f'Dep training df: {y_train.shape}')
# Test Data shape:
print(f'Indep test df: {X_test.shape}')
print(f'Dep test df: {y_test.shape}')
Observations:
# Percentage of classes in dataset:
# booking_status = 0 = Not_Canceled
# booking_status = 1 = Canceled
df_log_reg['booking_status'].value_counts()/df_log_reg.shape[0]
# Percentage of classes in training dataset:
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
# Percentage of classes in test dataset:
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Observations:
# Buiding Model:
model_lg_1 = sm.Logit(y_train, X_train.astype(float)).fit(disp=False)
print(model_lg_1.summary())
Observations:
# Prediction on training set
# Default Threshold is 0.5, if predicted probability is greater than 0.5 the observation will be classified as 1
pred_train = model_lg_1.predict(X_train) > 0.5
pred_train = np.round(pred_train)
pred_train.head()
# Creating confusion matrix for logistic regression model 'model_lg_1' on training data:
# Here, threshold value for classification is 0.5
confusion_matrix_statsmodels(model_lg_1, X_train, y_train)
print("Training performance:")
model_performance_classification_statsmodels(model_lg_1, X_train, y_train)
# Creating confusion matrix for logistic regression model 'model_lg_1' on testing data:
# Here, threshold value for classification is 0.5
confusion_matrix_statsmodels(model_lg_1, X_test, y_test)
print("Testing performance:")
model_performance_classification_statsmodels(model_lg_1, X_test, y_test)
Observation on 'model_lg_1':
| Type | Number of Observations | Percentage of Observations (%) |
|---|---|---|
| True Positive | 5303 | 20.93 |
| True Negative | 15199 | 59.99 |
| False Positive | 1820 | 7.18 |
| False Negative | 3015 | 11.90 |
| Type | Number of Observations | Percentage of Observations (%) |
|---|---|---|
| True Positive | 2244 | 20.66 |
| True Negative | 6429 | 59.20 |
| False Positive | 866 | 7.97 |
| False Negative | 1321 | 12.16 |
| Type of Dataset | Accuracy score |
|---|---|
| Train | 0.80917 |
| Test | 0.79862 |
| Type of Dataset | Recall score |
|---|---|
| Train | 0.63753 |
| Test | 0.62945 |
| Type of Dataset | F1 score |
|---|---|
| Train | 0.68687 |
| Test | 0.67236 |
vif = checking_vif(X_train)
vif
# Checking which columns have vif>5 :
high_vif_cols = vif.loc[(vif['VIF']>=5) & (vif['feature']!='const')]
column_list = []
for feature in high_vif_cols['feature']:
column_list.append(feature)
column_list
Observation:
The steps for dropping the high p_value variables are:
# Below code automates the above mention three steps for dropping high p_value variables:
# initial list of columns
cols = X_train.columns.tolist()
# setting an initial max p-value
max_p_value = 1
while len(cols) > 0:
# defining the train set
x_train_aux = X_train[cols]
# fitting the model
model = sm.Logit(y_train, x_train_aux).fit(disp=False)
# getting the p-values and the maximum p-value
p_values = model.pvalues
max_p_value = max(p_values)
# name of the variable with maximum p-value
feature_with_p_max = p_values.idxmax()
if max_p_value > 0.05:
cols.remove(feature_with_p_max)
else:
break
selected_features = cols
print(selected_features)
# X_train2 contains the data from all the columns whose p_value<0.05.
# We will use this to make our final model
X_train2 = X_train[selected_features]
# Rebuilding model (leaving out columns that have p_value >0.05)
model_lg_2 = sm.Logit(y_train, X_train2).fit(disp=False)
print(model_lg_2.summary())
# Predicting on training set
# Default Threshold is 0.5, if predicted probability is greater than 0.5 the observation will be classified as 1
pred_train = model_lg_2.predict(X_train2) > 0.5
pred_train = np.round(pred_train)
pred_train.head()
# Creating confusion matrix for logistic regression model 'model_lg_2' on training data:
# Here, threshold value for classification is 0.5
confusion_matrix_statsmodels(model_lg_2, X_train2, y_train)
log_reg_model_train_perf = model_performance_classification_statsmodels(model_lg_2, X_train2, y_train)
print("Training performance:")
log_reg_model_train_perf
X_test2 = X_test[selected_features]
# Creating confusion matrix for logistic regression model 'model_lg_2' on test data:
# Here, threshold value for classification is 0.5
confusion_matrix_statsmodels(model_lg_2, X_test2, y_test)
log_reg_model_test_perf = model_performance_classification_statsmodels(model_lg_2, X_test2, y_test)
print("Testing performance:")
log_reg_model_test_perf
Observations on model_lg_2 :
| Type | Number of Observations | Percentage of Observations (%) |
|---|---|---|
| True Positive | 5292 | 20.89 |
| True Negative | 15205 | 60.01 |
| False Positive | 1814 | 7.16 |
| False Negative | 3026 | 11.94 |
| Type | Number of Observations | Percentage of Observations (%) |
|---|---|---|
| True Positive | 2245 | 20.67 |
| True Negative | 6433 | 59.24 |
| False Positive | 862 | 7.94 |
| False Negative | 1320 | 12.15 |
| Type of Dataset | Accuracy score |
|---|---|
| Train | 0.80898 |
| Test | 0.79908 |
| Type of Dataset | Recall score |
|---|---|
| Train | 0.63621 |
| Test | 0.62973 |
| Type of Dataset | F1 score |
|---|---|
| Train | 0.68620 |
| Test | 0.67296 |
#Plotting ROC-AUC curve for training data:
logit_roc_auc_train = roc_auc_score(y_train, model_lg_2.predict(X_train2))
fpr, tpr, thresholds = roc_curve(y_train, model_lg_2.predict(X_train2))
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic for training dataset")
plt.legend(loc="lower right")
plt.show()
# Plotting ROC-AUC curve for test data:
logit_roc_auc_train = roc_auc_score(y_test, model_lg_2.predict(X_test2))
fpr, tpr, thresholds = roc_curve(y_test, model_lg_2.predict(X_test2))
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic for test dataset")
plt.legend(loc="lower right")
plt.show()
# Optimal threshold as per AUC-ROC curve
# The optimal cut off would be where tpr is high and fpr is low
fpr, tpr, thresholds = roc_curve(y_train, model_lg_2.predict(X_train2))
optimal_idx = np.argmax(tpr - fpr)
optimal_threshold_auc_roc = thresholds[optimal_idx]
print(optimal_idx)
print(optimal_threshold_auc_roc)
# Creating confusion matrix for logistic regression model 'model_lg_2' on train data:
# Here, threshold value for classification is 0.38 (i.e, optimal_threshold_auc_roc )
confusion_matrix_statsmodels(model_lg_2, X_train2, y_train, threshold=optimal_threshold_auc_roc)
log_reg_model_train_perf_threshold_auc_roc = model_performance_classification_statsmodels(
model_lg_2, X_train2, y_train, threshold=optimal_threshold_auc_roc
)
print("Training performance:")
log_reg_model_train_perf_threshold_auc_roc
# Creating confusion matrix for logistic regression model 'model_lg_2' on test data:
# Here, threshold value for classification is 0.38 (i.e, optimal_threshold_auc_roc )
confusion_matrix_statsmodels(model_lg_2, X_test2, y_test, threshold=optimal_threshold_auc_roc)
log_reg_model_test_perf_threshold_auc_roc = model_performance_classification_statsmodels(
model_lg_2, X_test2, y_test, threshold=optimal_threshold_auc_roc
)
print("Testing performance:")
log_reg_model_test_perf_threshold_auc_roc
Observations for 'model_lg_2' (with threshold = 0.38):
| Type | Number of Observations | Percentage of Observations (%) |
|---|---|---|
| True Positive | 6052 | 23.89 |
| True Negative | 14194 | 56.02 |
| False Positive | 2825 | 11.15 |
| False Negative | 2266 | 8.94 |
| Type | Number of Observations | Percentage of Observations (%) |
|---|---|---|
| True Positive | 2586 | 23.81 |
| True Negative | 6006 | 55.30 |
| False Positive | 1289 | 11.87 |
| False Negative | 979 | 9.01 |
| Type of Dataset | Accuracy score |
|---|---|
| Train | 0.79907 |
| Test | 0.79116 |
| Type of Dataset | Recall score |
|---|---|
| Train | 0.72758 |
| Test | 0.72539 |
| Type of Dataset | F1 score |
|---|---|
| Train | 0.70393 |
| Test | 0.69516 |
# Plotting Precision-Recall Curves:
y_scores = model_lg_2.predict(X_train2)
prec, rec, tre = precision_recall_curve(y_train, y_scores)
def plot_prec_recall_vs_tresh(precisions, recalls, thresholds):
plt.plot(thresholds, precisions[:-1], "b--", label="precision")
plt.plot(thresholds, recalls[:-1], "g--", label="recall")
plt.xlabel("Threshold")
plt.legend(loc="upper left")
plt.ylim([0, 1])
plt.figure(figsize=(10, 7))
plot_prec_recall_vs_tresh(prec, rec, tre)
plt.show()
# setting the threshold
optimal_threshold_curve1 = 0.37
# Creating confusion matrix for logistic regression model 'model_lg_2' on train data:
# Here, threshold value for classification is 0.37 (i.e, optimal_threshold_curve1 )
confusion_matrix_statsmodels(model_lg_2, X_train2, y_train, threshold=optimal_threshold_curve1)
log_reg_model_train_perf_threshold_curve1 = model_performance_classification_statsmodels(
model_lg_2, X_train2, y_train, threshold=optimal_threshold_curve1
)
print("Training performance:")
log_reg_model_train_perf_threshold_curve1
# Creating confusion matrix for logistic regression model 'model_lg_2' on test data:
# Here, threshold value for classification is 0.37 (i.e, optimal_threshold_curve1 )
confusion_matrix_statsmodels(model_lg_2, X_test2, y_test, threshold=optimal_threshold_curve1)
log_reg_model_test_perf_threshold_curve1 = model_performance_classification_statsmodels(
model_lg_2, X_test2, y_test, threshold=optimal_threshold_curve1
)
print("Testing performance:")
log_reg_model_test_perf_threshold_curve1
Observations on 'model_lg_2' (with threshold = 0.37):
| Type | Number of Observations | Percentage of Observations (%) |
|---|---|---|
| True Positive | 6133 | 24.21 |
| True Negative | 13977 | 55.16 |
| False Positive | 3042 | 12.01 |
| False Negative | 2158 | 8.62 |
| Type | Number of Observations | Percentage of Observations (%) |
|---|---|---|
| True Positive | 2607 | 24.01 |
| True Negative | 5919 | 54.50 |
| False Positive | 1376 | 12.67 |
| False Negative | 958 | 8.82 |
| Type of Dataset | Accuracy score |
|---|---|
| Train | 0.79370 |
| Test | 0.78508 |
| Type of Dataset | Recall score |
|---|---|
| Train | 0.73732 |
| Test | 0.73128 |
| Type of Dataset | F1 score |
|---|---|
| Train | 0.70119 |
| Test | 0.69078 |
# Setting the optimal threshold from precision-recall curve:
optimal_threshold_curve2 = 0.42
# Creating confusion matrix for logistic regression model 'model_lg_2' on training data:
# Here, threshold value for classification is 0.42 (i.e, optimal_threshold_curve2 )
confusion_matrix_statsmodels(model_lg_2, X_train2, y_train, threshold=optimal_threshold_curve2)
log_reg_model_train_perf_threshold_curve2 = model_performance_classification_statsmodels(
model_lg_2, X_train2, y_train, threshold=optimal_threshold_curve2
)
print("Training performance:")
log_reg_model_train_perf_threshold_curve2
# Creating confusion matrix for logistic regression model 'model_lg_2' on test data:
# Here, threshold value for classification is 0.42 (i.e, optimal_threshold_curve2 )
confusion_matrix_statsmodels(model_lg_2, X_test2, y_test, threshold=optimal_threshold_curve2)
log_reg_model_test_perf_threshold_curve2 = model_performance_classification_statsmodels(
model_lg_2, X_test2, y_test, threshold=optimal_threshold_curve2
)
print("Testing performance:")
log_reg_model_test_perf_threshold_curve2
Observations:
| Type | Number of Observations | Percentage of Observations (%) |
|---|---|---|
| True Positive | 5829 | 23.01 |
| True Negative | 14522 | 57.32 |
| False Positive | 2497 | 9.86 |
| False Negative | 2489 | 9.82 |
| Type | Number of Observations | Percentage of Observations (%) |
|---|---|---|
| True Positive | 2502 | 23.04 |
| True Negative | 6155 | 56.68 |
| False Positive | 1140 | 10.50 |
| False Negative | 1063 | 9.79 |
| Type of Dataset | Accuracy score |
|---|---|
| Train | 0.80321 |
| Test | 0.79715 |
| Type of Dataset | Recall score |
|---|---|
| Train | 0.70077 |
| Test | 0.70182 |
| Type of Dataset | F1 score |
|---|---|
| Train | 0.70043 |
| Test | 0.69432 |
# training performance comparison
models_train_comp_df = pd.concat(
[
log_reg_model_train_perf.T,
log_reg_model_train_perf_threshold_curve1.T,
log_reg_model_train_perf_threshold_auc_roc.T,
log_reg_model_train_perf_threshold_curve2.T,
],
axis=1,)
models_train_comp_df.columns = [
"Logistic Regression-default Threshold",
"Logistic Regression-0.37 Threshold",
"Logistic Regression-0.38 Threshold",
"Logistic Regression-0.42 Threshold",
]
print("Training performance comparison:")
models_train_comp_df
# testing performance comparison
models_test_comp_df = pd.concat(
[
log_reg_model_test_perf.T,
log_reg_model_test_perf_threshold_curve1.T,
log_reg_model_test_perf_threshold_auc_roc.T,
log_reg_model_test_perf_threshold_curve2.T,
],
axis=1,)
models_test_comp_df.columns = [
"Logistic Regression-default Threshold",
"Logistic Regression-0.37 Threshold",
"Logistic Regression-0.38 Threshold",
"Logistic Regression-0.42 Threshold",
]
print("Testing performance comparison:")
models_test_comp_df
Observations:
# converting coefficients to odds
odds = np.exp(model_lg_2.params)
# finding the percentage change
perc_change_odds = (np.exp(model_lg_2.params) - 1) * 100
# removing limit from number of columns to display
pd.set_option("display.max_columns", None)
# adding the odds to a dataframe
pd.DataFrame({"Odds": odds, "Change_odd%": perc_change_odds}, index=X_train2.columns).T
# converting coefficients to odds
odds = np.exp(model_lg_2.params)
# finding the percentage change
perc_change_odds = round((np.exp(model_lg_2.params) - 1) * 100,2)
# removing limit from number of columns to display
pd.set_option("display.max_columns", None)
# Adding the odds to a data frame
pd.DataFrame({"Odds": odds, "Change_odd%": perc_change_odds}, index=X_train2.columns).T
Coefficient interpretations:
# Converting columns with 'object' datatypes to categorical columns:
for feature in df_dtree.columns:
if df_dtree[feature].dtype == 'object':
df_dtree[feature] = pd.Categorical(df_dtree[feature])# Replace strings with an integer
df_dtree['type_of_meal_plan'].dtype
df_dtree['room_type_reserved'].dtype
df_dtree['market_segment_type'].dtype
# One Hot Encoding columns 'room_type_reserved', 'market_segment_type' and 'type_of_meal_plan'.
oneHotCols=['room_type_reserved', 'market_segment_type','type_of_meal_plan']
# Creating dummy columns:
df_dtree = pd.get_dummies(df_dtree, columns=oneHotCols, drop_first=True)
df_dtree.info()
# Splitting data into independent(X) and dependent(y) variables:
X = df_dtree.drop("booking_status" , axis=1)
y = df_dtree['booking_status']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.30, random_state=1)
# Train data shape:
print(f'Indep training df: {X_train.shape}')
print(f'Dep training df: {y_train.shape}')
# Test Data shape:
print(f'Indep test df: {X_test.shape}')
print(f'Dep test df: {y_test.shape}')
Observations:
# Percentage of classes in dataset:
# booking_status = 0 = Not_Canceled
# booking_status = 1 = Canceled
df_dtree['booking_status'].value_counts()/df_dtree.shape[0]
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Observations:
Around 67% of observations belong to class 0 (Booking status = Not_Canceled) and 33% of observations belong to class 1 (Booking status = Canceled), and this is preserved in the train and test sets.
# Building decision tree model :
model_dtree1 = DecisionTreeClassifier(random_state=1)
model_dtree1.fit(X_train, y_train)
print('Accuracy on training set : ', model_dtree1.score(X_train, y_train))
print('Accuracy on test set : ', model_dtree1.score(X_test, y_test))
# Confusion matrix for model 'model_dtree1' on training set:
confusion_matrix_sklearn(model_dtree1, X_train, y_train)
dtree_perf_train = model_performance_classification_sklearn(model_dtree1, X_train, y_train)
print('Training data performance:')
dtree_perf_train
# Confusion matrix for model 'model_dtree1' on testing set:
confusion_matrix_sklearn(model_dtree1, X_test, y_test)
dtree_perf_test = model_performance_classification_sklearn(model_dtree1, X_test, y_test)
print('Testing data performance:')
dtree_perf_test
Observations on model_dtree1:
| Type | Number of Observations | Percentage of Observations (%) |
|---|---|---|
| True Positive | 8158 | 32.20 |
| True Negative | 17032 | 67.22 |
| False Positive | 37 | 0.15 |
| False Negative | 110 | 0.43 |
| Type | Number of Observations | Percentage of Observations (%) |
|---|---|---|
| True Positive | 2883 | 26.55 |
| True Negative | 6350 | 60.13 |
| False Positive | 715 | 6.58 |
| False Negative | 732 | 6.74 |
| Type of Dataset | Accuracy score |
|---|---|
| Train | 0.99420 |
| Test | 0.86676 |
| Type of Dataset | Recall score |
|---|---|
| Train | 0.98670 |
| Test | 0.79751 |
| Type of Dataset | F1 score |
|---|---|
| Train | 0.99107 |
| Test | 0.79939 |
# Printing the features:
feature_names = list(X.columns)
print(feature_names)
# Text report showing the rules of the decision tree model 'model_dtree1':
print(tree.export_text(model_dtree1,feature_names=feature_names,show_weights=True))
# Checking important features in 'model_dtree1':
feature_names = list(X_train.columns)
importances = model_dtree1.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
# Importance of features of model 'model_dtree1'
print (pd.DataFrame(model_dtree1.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
Observation:
# Creating decision tree model using pre-pruning:
# Here, max_depth is set to 3
model_dtree4 = DecisionTreeClassifier(max_depth =3, random_state=1, class_weight="balanced")
model_dtree4.fit(X_train, y_train)
# Confusion matrix for model 'model_dtree4' on training data:
confusion_matrix_sklearn(model_dtree4, X_train, y_train)
dtree_tune3_perf_train = model_performance_classification_sklearn(model_dtree4, X_train, y_train)
print('Training data performance:')
dtree_tune3_perf_train
# Confusion matrix for model 'model_dtree4' on test data:
confusion_matrix_sklearn(model_dtree4, X_test, y_test)
dtree_tune3_perf_test = model_performance_classification_sklearn(model_dtree4, X_test, y_test)
print('Testing data performance:')
dtree_tune3_perf_test
# Visualizing the Decision Tree 'model_dtree4':
plt.figure(figsize=(20, 10))
out = tree.plot_tree(model_dtree4,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of the decision tree model 'model_dtree4':
print(tree.export_text(model_dtree4,feature_names=feature_names,show_weights=True))
#Checking important features in 'model_dtree4':
feature_names = list(X_train.columns)
importances = model_dtree4.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
# Importance of features of model 'model_dtree4':
print (pd.DataFrame(model_dtree4.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
Observations on 'model_dtree4':
| Type | Number of Observations | Percentage of Observations (%) |
|---|---|---|
| True Positive | 6108 | 24.11 |
| True Negative | 13854 | 54.68 |
| False Positive | 3215 | 12.69 |
| False Negative | 2160 | 8.53 |
| Type | Number of Observations | Percentage of Observations (%) |
|---|---|---|
| True Positive | 2609 | 24.02 |
| True Negative | 5908 | 54.40 |
| False Positive | 1377 | 12.31 |
| False Negative | 1006 | 9.26 |
| Type of Dataset | Accuracy score |
|---|---|
| Train | 0.78786 |
| Test | 0.78425 |
| Type of Dataset | Recall score |
|---|---|
| Train | 0.73875 |
| Test | 0.72172 |
| Type of Dataset | F1 score |
|---|---|
| Train | 0.69445 |
| Test | 0.69012 |
# Creating decision tree model using pre-pruning:
# Here, max_depth is set to 5
model_dtree5 = DecisionTreeClassifier(max_depth =5, random_state=1, class_weight="balanced")
model_dtree5.fit(X_train, y_train)
# Confusion matrix for model 'model_dtree5' on training data:
confusion_matrix_sklearn(model_dtree5, X_train, y_train)
dtree_tune5_perf_train = model_performance_classification_sklearn(model_dtree5, X_train, y_train)
print('Training data performance:')
dtree_tune5_perf_train
# Confusion matrix for model 'model_dtree5' on test data:
confusion_matrix_sklearn(model_dtree5, X_test, y_test)
dtree_tune5_perf_test = model_performance_classification_sklearn(model_dtree5, X_test, y_test)
print('Testing data performance:')
dtree_tune5_perf_test
# Visualizing the Decision Tree 'model_dtree5':
plt.figure(figsize=(20, 10))
out = tree.plot_tree(model_dtree5,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of the decision tree model 'model_dtree5':
print(tree.export_text(model_dtree5,feature_names=feature_names,show_weights=True))
# Checking important features in 'model_dtree5':
feature_names = list(X_train.columns)
importances = model_dtree5.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
# Importance of features of model 'model_dtree5'
print (pd.DataFrame(model_dtree5.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
Observations on model_dtree5:
| Type | Number of Observations | Percentage of Observations (%) |
|---|---|---|
| True Positive | 6176 | 24.38 |
| True Negative | 14936 | 58.95 |
| False Positive | 2133 | 8.42 |
| False Negative | 2092 | 8.26 |
| Type | Number of Observations | Percentage of Observations (%) |
|---|---|---|
| True Positive | 2653 | 24.43 |
| True Negative | 6362 | 58.58 |
| False Positive | 883 | 8.13 |
| False Negative | 962 | 8.86 |
| Type of Dataset | Accuracy score |
|---|---|
| Train | 0.83325 |
| Test | 0.83011 |
| Type of Dataset | Recall score |
|---|---|
| Train | 0.74698 |
| Test | 0.73389 |
| Type of Dataset | F1 score |
|---|---|
| Train | 0.74513 |
| Test | 0.74199 |
# Creating Decision Tree model (using Grid Search for Hyperparameter Tuning of the model):
# Choosing the type of classifier:
model_dtree2 = DecisionTreeClassifier(random_state=1, class_weight="balanced")
# Grid of parameters to choose from:
parameters = {
"max_depth": np.arange(2, 7, 2),
"max_leaf_nodes": [50, 75, 150, 250],
"min_samples_split": [10, 30, 50, 70],
}
# Type of scoring used to compare parameter combinations:
acc_scorer = make_scorer(f1_score)
# Run the grid search:
grid_obj = GridSearchCV(model_dtree2, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
model_dtree2 = grid_obj.best_estimator_
# Fit the best algorithm to the data.
model_dtree2.fit(X_train, y_train)
print('Accuracy on training set : ', round(model_dtree2.score(X_train, y_train),2))
print('Accuracy on test set : ', round(model_dtree2.score(X_test, y_test),2))
Observations:
# Confusion matrix for model 'model_dtree2' on training data:
confusion_matrix_sklearn(model_dtree2, X_train, y_train)
dtree_tune_perf_train = model_performance_classification_sklearn(model_dtree2, X_train, y_train)
print('Training data performance:')
dtree_tune_perf_train
# Confusion matrix for model 'model_dtree2' on test data:
confusion_matrix_sklearn(model_dtree2, X_test, y_test)
dtree_tune_perf_test = model_performance_classification_sklearn(model_dtree2, X_test, y_test)
print('Testing data performance:')
dtree_tune_perf_test
Observations on model_dtree2:
| Type | Number of Observations | Percentage of Observations (%) |
|---|---|---|
| True Positive | 6619 | 26.12 |
| True Negative | 14399 | 56.83 |
| False Positive | 2670 | 10.54 |
| False Negative | 1649 | 6.51 |
| Type | Number of Observations | Percentage of Observations (%) |
|---|---|---|
| True Positive | 2863 | 26.36 |
| True Negative | 6122 | 56.37 |
| False Positive | 1123 | 10.34 |
| False Negative | 752 | 6.92 |
| Type of Dataset | Accuracy score |
|---|---|
| Train | 0.82954 |
| Test | 0.82735 |
| Type of Dataset | Recall score |
|---|---|
| Train | 0.80056 |
| Test | 0.79198 |
| Type of Dataset | F1 score |
|---|---|
| Train | 0.75400 |
| Test | 0.75332 |
# Visualizing the Decision Tree 'model_dtree2':
plt.figure(figsize=(20, 10))
out = tree.plot_tree(model_dtree2,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of the decision tree model 'model_dtree2':
print(tree.export_text(model_dtree2, feature_names=feature_names, show_weights=True))
#Checking important features in 'model_dtree2':
importances = model_dtree2.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
# Importance of features in model 'model_dtree2':
print (pd.DataFrame(model_dtree2.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
Observation:
# Creating Decision Tree model (using Cost Complexity Pruning):
model_dtree3 = DecisionTreeClassifier(random_state=1, class_weight="balanced")
path = model_dtree3.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = abs(path.ccp_alphas), path.impurities
pd.DataFrame(path)
# Plotting Total Impurity vs effective alpha for training set:
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()
# Training decision tree using the effective alphas :
clfs = []
for ccp_alpha in ccp_alphas:
model_dtree3 = DecisionTreeClassifier(
random_state=1, ccp_alpha=ccp_alpha, class_weight="balanced"
)
model_dtree3.fit(X_train, y_train)
clfs.append(model_dtree3)
print(
"Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]
)
)
# Plotting No. of Nodes and Depth of the tree vs alpha:
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1, figsize=(10, 7))
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
Observations:
# Plotting accuracy scores for training and test sets:
accuracy_train = []
for clf in clfs:
pred_train = clf.predict(X_train)
values_train = accuracy_score(y_train, pred_train)
accuracy_train.append(values_train)
accuracy_test = []
for clf in clfs:
pred_test = clf.predict(X_test)
values_test = accuracy_score(y_test, pred_test)
accuracy_test.append(values_test)
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("Accuracy Score")
ax.set_title("Accuracy Score vs alpha for training and testing sets")
ax.plot(ccp_alphas, accuracy_train, marker="o", label="train", drawstyle="steps-post")
ax.plot(ccp_alphas, accuracy_test, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()
# Plotting recall scores for training and test sets:
recall_train = []
for clf in clfs:
pred_train = clf.predict(X_train)
values_train = recall_score(y_train, pred_train)
recall_train.append(values_train)
recall_test = []
for clf in clfs:
pred_test = clf.predict(X_test)
values_test = recall_score(y_test, pred_test)
recall_test.append(values_test)
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall Score")
ax.set_title("Recall Score vs alpha for training and testing sets")
ax.plot(ccp_alphas, recall_train, marker="o", label="train", drawstyle="steps-post")
ax.plot(ccp_alphas, recall_test, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()
# Plotting F1 scores for training and test sets:
f1_train = []
for clf in clfs:
pred_train = clf.predict(X_train)
values_train = f1_score(y_train, pred_train)
f1_train.append(values_train)
f1_test = []
for clf in clfs:
pred_test = clf.predict(X_test)
values_test = f1_score(y_test, pred_test)
f1_test.append(values_test)
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("F1 Score")
ax.set_title("F1 Score vs alpha for training and testing sets")
ax.plot(ccp_alphas, f1_train, marker="o", label="train", drawstyle="steps-post")
ax.plot(ccp_alphas, f1_test, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()
# Creating the model where we get highest train and test F1 score:
index_best_model = np.argmax(f1_test)
best_model = clfs[index_best_model]
print(best_model)
# Confusion matrix for model 'best_model' on training data:
confusion_matrix_sklearn(best_model, X_train, y_train)
dtree_post_perf_train = model_performance_classification_sklearn(
best_model, X_train, y_train
)
print('Training Performance')
dtree_post_perf_train
# Confusion matrix for model 'best_model' on test data:
confusion_matrix_sklearn(best_model, X_test, y_test)
dtree_post_perf_test = model_performance_classification_sklearn(
best_model, X_test, y_test
)
print('Testing Performance')
dtree_post_perf_test
Observations on 'best_model':
| Type | Number of Observations | Percentage of Observations (%) |
|---|---|---|
| True Positive | 7681 | 30.32 |
| True Negative | 15558 | 61.40 |
| False Positive | 1511 | 5.96 |
| False Negative | 587 | 2.32 |
| Type | Number of Observations | Percentage of Observations (%) |
|---|---|---|
| True Positive | 3088 | 28.43 |
| True Negative | 6307 | 58.08 |
| False Positive | 938 | 8.64 |
| False Negative | 527 | 4.85 |
| Type of Dataset | Accuracy score |
|---|---|
| Train | 0.91720 |
| Test | 0.86510 |
| Type of Dataset | Recall score |
|---|---|
| Train | 0.92900 |
| Test | 0.85422 |
| Type of Dataset | F1-score |
|---|---|
| Train | 0.87984 |
| Test | 0.80827 |
# Visualizing the Decision Tree for model 'best_model':
plt.figure(figsize=(20, 10))
out = tree.plot_tree(best_model,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of the decision tree model 'best_model':
print(tree.export_text(best_model, feature_names=feature_names, show_weights=True))
# Checking important features in 'best_model':
feature_names = list(X_train.columns)
importances = best_model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
# Importance of features in the tree building for model 'best_model':
print (pd.DataFrame(best_model.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
Observations:
# Training performance comparison
models_train_comp_df = pd.concat(
[
dtree_perf_train.T,
dtree_tune3_perf_train.T,
dtree_tune5_perf_train.T,
dtree_tune_perf_train.T,
dtree_post_perf_train.T
],
axis=1,
)
models_train_comp_df.columns = [
"Decision Tree sklearn",
"Decision Tree (Pre-Pruning, max_depth=3)",
'Decision Tree (Pre-Pruning, max_depth=5)',
'Decision Tree (Pre-Pruning using GridSeacrchCV)',
"Decision Tree (Post-Pruning)"
]
print("Training performance comparison:")
models_train_comp_df
# Testing performance comparison
models_test_comp_df = pd.concat(
[
dtree_perf_test.T,
dtree_tune3_perf_test.T,
dtree_tune5_perf_test.T,
dtree_tune_perf_test.T,
dtree_post_perf_test.T
],
axis=1,
)
models_test_comp_df.columns = [
"Decision Tree sklearn",
"Decision Tree (Pre-Pruning, max_depth=3)",
'Decision Tree (Pre-Pruning, max_depth=5)',
'Decision Tree (Pre-Pruning using GridSeacrchCV)',
"Decision Tree (Post-Pruning)"
]
print("Testing performance comparison:")
models_test_comp_df
Observations:
* The duration between the booking date of a reservation and arrival date is high (Roughly about by 4.56 months).
* The price of the room is on the higher end.
* The number of special requests made by the guests while making a reservation is low.
* The reservation is made via Online platforms.
Inn Hotel Group needs to investigate the steps to be taken in case the booking has the characteristics mentioned above, so as to reduce the chances of a no-show or cancelling the reservation in order to maximise revenue/profits.INN Hotel Group could further analyze a customer's repeated behavior of no-show so that they could explain why there is a no-show fee.
INN Hotels Group should analyze if their website allows for easier navigation to new customers so that they can easily book their stay, especially during peak seasons.